Alternative libraries

Before this, the two main libraries used for scraping a webpage were requests and BeautifulSoup. However, there ar ealso alternative libraries that can serve the same purpose.

  • urllib2 - the standard Python library for sending requests to URL and reading the HTML content. The two main functions are urlopen() (similar to get() from requests) and read() (similar to text from requests)
  • lxml - a third party library (like BeautifulSoup) used for parsing xml and html files. The syntax is very similar to that of BeautifulSoup yet this library is much faster. The disadvantage is that it best suits for standard webpages, not for more or less unstructured ones (not for soups).
It is worth to note, that **lxml** has a soupparser method (**lxml.html.soupparser**), which *"mimics"* the **BeautifulSoup** approach. At the same time, **BeautifulSoup()** functino from the samename library can take **lxml** as an argument and use the latter as a parser to scrape the websites more quickly.

In [1]:
import urllib2
from lxml import html

In [2]:
url = "https://careercenter.am/ccidxann.php"

In [3]:
response = urllib2.urlopen(url)
page = response.read()

tree = html.document_fromstring(page)

The findAll() function from BeautifulSoup is replaced by cssselect() in lxml, which finds all the tags given inside quotes as follows.


In [4]:
tables = tree.cssselect("table")

In [5]:
len(tables)


Out[5]:
5

To get the text content of the tag the text_content() function should be used on an element of the list.


In [6]:
tables[-1].text_content()


Out[6]:
'\n    COMPETITIONS\n    \n      \n      Open Tender for Choosing an Organization to Purchase Ink System for ATM Cassettes / Union of Banks of Armenia\n    \n  '

We may use the table attributes to find the correct table that we are looking for. Multiple attributes can be listed one by one each inside square brackets and separated by a comma as follows:


In [7]:
our_table = tree.cssselect('[width="100%"],[border="0"]')

To get the text content of each table, we should create a for loop that will iterate over the list of tables and provide us with the text content.


In [8]:
for i in our_table:
    print(i.text_content())







    JOB OPPORTUNITIES
    
      
      IT Project Manager, IT Department / HSBC Bank Armenia
    
    
      
      Medical Representative / Les Laboratoires Servier Armenia
    
    
      
      Marketing Manager / Accurate Group
    
    
      
      Technical Specifications Development Specialist / RA Ministry of Health  Centralized Procurement Organization and Coordination Working Group
    
    
      
      Pharmacologist / RA Ministry of Health  Centralized Procurement Organization and Coordination Working Group
    
    
      
      Procurement Specialist / RA Ministry of Health  Centralized Procurement Organization and Coordination Working Group
    
    
      
      National Consultant/ Translator to Assist in Learning Evaluation Mission / GIZ
    
    
      
      Store Manager / Zigzag
    
    
      
      Business Development Manager / Oriflame Cosmetics
    
    
      
      Warehouse Worker / Oriflame Cosmetics
    
    
      
      Chief Accountant / Alfa-Pharm
    
    
      
      Assistant to Accountant / Noyan Tapan
    
    
      
      Commercial Officer / Brabion Flora Service
    
    
      
      Accountant / D&H Group
    
    
      
      Sales Consultant / Rouge Beaute
    
    
      
      TV Programs Recording and Control Specialist / Public Television Company of Armenia
    
    
      
      Business Intelligence Developer / SFL
    
    
      
      Refrigeration Specialist / TST
    
    
      
      Sales Manager / TST
    
    
      
      Senior C++ Software Engineer / RUSHC
    
    
      
      Senior Verilog Engineer / RUSHC
    
    
      
      Mid-Level Software Quality Assurance Specialist / Webb Fontaine Holding
    
    
      
      Communications Manager / urbanlab
    
    
      
      B2B Marketing Expert / ArmenTel
    
    
      
      Announcements Moderator / Career Center
    
    
      
      Flower Store Administrator / Career Center Partner Company
    
    
      
      Broadcasting Provision Service Specialist / Public Television Company of Armenia
    
    
      
      Head of Sales Department / Karcomauto
    
    
      
      UI Developer / CargoMatrix
    
    
      
      Backend Developer / CargoMatrix
    
    
      
      IT Project Manager / CargoMatrix
    
    
      
      Operational  Manager / Cascade Consultants on Behalf of Bycop
    
    
      
      Dealer/ Portfolio Manager / Cascade Consultants on Behalf of a Client Company
    
    
      
      Corporate Sales Manager, Sevan Branch / Anelik Bank
    
    
      
      Corporate Loan Analyst, Sevan Branch / Anelik Bank
    
    
      
      Agronomist/ Plant Breeder / Agrotech
    
    
      
      Chief Accountant/ Head of Accounting and Tax Department / Tower International Consultants
    
    
      
      Senior System Engineer / HayTech Solutions
    
    
      
      Leading System Administrator / Public Television Company of Armenia
    
    
      
      Senior Accountant / Finlex
    
    
      
      Accountant / Varks.am
    
    
      
      UI/ UX Designer / Sgames
    
    
      
      HR Assistant / Career Center
    
    
      
      Receptionist / Ibis Yerevan Center
    
    
      
      Internal Control Specialist / Varks.am
    
    
      
      Call Center Specialist / Varks.am
    
    
      
      Chief Information Officer/ Head of IT and Automation Division / Ameriabank
    
    
      
      Sales Agent / Opera Suite Hotel
    
    
      
      Event Coordinator / Opera Suite Hotel
    
    
      
      Marketing Coordinator / Opera Suite Hotel
    
    
      
      Node.js Developer / Zangi
    
    
      
      Operator Trainer / Immersive Technologies
    
    
      
      International Sales Manager / VoIPShop Telecommunications
    
    
      
      Digital Marketing Manager, General Directorate / ArmenTel
    
    
      
      Partnerships Manager, General Directorate / ArmenTel
    
    
      
      Senior Software Developer / LSoft
    
    
      
      Junior Software Developer / LSoft
    
    
      
      Economist / Armenian Harvest Promotion Center
    
    
      
      Internal Accountant / Career Center Partner Company
    
    
      
      NOC Engineer / VivaCell-MTS
    
    
      
      Programmer Analyst, Risk Management Department / Anelik Bank
    
    
      
      Corporate Sales Manager, Abovyan Branch / Anelik Bank
    
    
      
      Corporate Sales Manager, Vanadzor Branch / Anelik Bank
    
    
      
      Corporate Loan Analyst, Vanadzor Branch / Anelik Bank
    
    
      
      Client Service Director, Advertising Unit / Publicis Hepta
    
    
      
      Head of Dell Brand Department / M.U.K. Computers
    
    
      
      Administrator / M.U.K. Computers
    
    
      
      Service Engineer / M.U.K. Computers
    
    
      
      Sales Manager / M.U.K. Computers
    
    
      
      Translator / VivaCell-MTS
    
    
      
      Painter / U.S. Embassy Yerevan
    
    
      
      Medical Representative/ Key Account Manager / FIC Medical, Armenia
    
    
      
      Operational Manager / Bycop
    
    
      
      Senior Internal Auditor / Galaxy Concern
    
    
      
      Assistant to Director / Bars
    
    
      
      SMS Network Engineer / Dexatel
    
    
      
      Sales and Marketing Specialist, SM Department / Comfort R&V
    
    
      
      ArmSoft Application Manager / FINCA UCO
    
    
      
      International Sales Manager / Intelcom Line
    
    
      
      Dentist / Telia-Med
    
    
      
      Senior Front-End Developer / Benivo
    
    
      
      Senior Back-End Developer / Benivo
    
    
      
      Program Development Associate / IDeA
    
    
      
      Administrative Clerk/ Chauffeur / U.S. Embassy Yerevan
    
    
      
      Business Advisor / Inecobank
    
    
      
      Full Stack Developer / Baldi Retail
    
    
      
      Credit Information and Collection Officer / Byblos Bank Armenia
    
    
      
      Head of Credit Information and Collection Unit / Byblos Bank Armenia
    
    
      
      Senior PHP/ Magento Developer / CertiPro Solutions
    
    
      
      Product Manager / e-World Systems
    
    
      
      Grants Manager for Bridge4CSOs Project / Armenian General Benevolent Union
    
    
      
      Development and Fundraising Officer / CRRC-Armenia
    
    
      
      Administrative Assistant / SAS Group
    
    
      
      UI/ UX Designer / SFL
    
    
      
      Head of Disaster Management Department / Armenian Red Cross Society
    
    
      
      Data Manager/ IT Specialist / DarmanTest Laboratories
    
    
      
      Graphic Designer / Grand Candy
    
    
      
      Category Manager / Spayka
    
    
      
      Web Content Manager / FXTM Armenia
    
    
      
      Retail Store Manager/ Brand Manager / Guess Armenia
    
    
      
      Sales Consultant / Chronograph Boutique
    
    
      
      Field Engineer / Lydian Armenia
    
    
      
      Maintenance Planning Engineer / Lydian Armenia
    
    
      
      Short-Term Expert for Gap Analysis on Current Internal Control System  / GIZ
    
    
      
      Procurement and Contract Management Specialist / Transport Project Implementation Organization
    
    
      
      UNIX System Adminsitrator / VivaCell-MTS
    
    
      
      RAN Engineer / VivaCell-MTS
    
    
      
      Transmission Engineer / VivaCell-MTS
    
    
      
      Sales Consultant / Mobile Centre
    
    
      
      Assistant Project Manager / BigBek
    
    
      
      Senior Front-End Developer / BigBek
    
    
      
      Graphic Designer / BigBek
    
    
      
      Web Developer / BigBek
    
    
      
      Director of Recruitment, Selection and Matriculation / Teach For Armenia
    
    
      
      HVAC Systems Design Engineer / Consel
    
    
      
      Content Manager / Frismos
    
    
      
      iOS Developer / M&M Media
    
    
      
      Android Developer / M&M Media
    
    
      
      Junior Software Developer / CertiPro Solutions
    
    
      
      Senior Network Administrator / ArmenTel
    
    
      
      Junior iOS Developer / Zangi
    
    
      
      Revenue Assurance Senior Specialist / VivaCell-MTS
    
    
      
      Internet Marketing Specialist / Best Card
    
    
      
      Administrative Assistant, Marketing Department / Aras Food
    
    
      
      Back-End Developer / ApolloBytes
    
    
      
      ReactJS Developer / ApolloBytes
    
    
      
      Director of Sales Department / VAS Group
    
    
      
      Persian Language Specialist / Rosgosstrakh-Armenia
    
    
      
      ISMS Senior Analyst / VivaCell-MTS
    
    
      
      Senior .NET Developer / SouthTech Consulting
    
    
      
      Chief Accountant / Noyan Tapan
    
    
      
      Senior Internal Auditor / FINCA UCO
    
    
      
      Credit Officer / Prometey Bank
    
    
      
      Finance Director / Reso Insurance
    
    
      
      FTTB, ADSL/ VDSL Networks Monitoring Technical Expert / ArmenTel
    
    
      
      Digital Platforms Manager / ArmenTel
    
    
      
      Consultant/ Seller / TST
    
    
      
      Operations Research Developer / Optym Armenia
    
    
      
      Front-End Developer / 4H
    
    
      
      Specialist of Reconciliation Division / ArmSwissBank
    
    
      
      Specialist of Loans Processing and Reporting Division / ArmSwissBank
    
    
      
      Head of Digital Banking / Ameriabank
    
    
      
      Data Analyst / IPSC
    
    
      
      Account Manager, Client Service Department / McCann Erickson
    
    
      
      Digital Marketing Specialist / McCann Erickson
    
    
      
      Medical Representative/ Medical Equipment Specialist / Concern-Energomash
    
    
      
      Receptionist / Envoy Hostel
    
    
      
      Digital Innovations Specialist / Ucom
    
    
      
      Graphic  Designer / Baldi Retail
    
    
      
      Mid-Level Front-End Software Engineer / Aarki
    
    
      
      Junior Front-End Software Engineer / Aarki
    
    
      
      Java Developer / IUNetworks
    
    
      
      JavaScript Developer / IUNetworks
    
    
      
      Media Planner / Media Systems
    
    
      
      Customer Service Specialist / Varks.am
    
    
      
      Sales Assistant / Bass Boutique Hotel
    
    
      
      Technical Support Specialist / Ucom
    
    
      
      Head of Research Center / Darmantest Laboratories
    
    
      
      Senior Full Stack Web Developer / Evolver
    
    
      
      Social Media Marketing Specialist / SAS Group
    
    
      
      Application Security Engineer / Workfront
    
    
      
      Security Operations Analyst / Workfront
    
    
      
      Head of Marketing / Digitain
    
    
      
      HR Manager / Monazite
    
    
      
      Restaurant Manager / Monazite
    
    
      
      Hotel Director / Monazite
    
    
      
      IT Administrator / BDO Armenia
    
    
      
      Ejmiatsin Branch Customer Service Specialist / Varks.am
    
    
      
      Gyumri Branch Customer Service Specialist / Varks.am
    
    
      
      Stepanavan Branch Customer Service Specialist / Varks.am
    
    
      
      Vanadzor Branch Customer Service Specialist / Varks.am
    
    
      
      Armavir Branch Customer Service Specialist / Varks.am
    
    
      
      Abovyan Branch Customer Service Specialist / Varks.am
    
    
      
      Accounts Payable Clerk / MAF Carrefour Armenia
    
    
      
      Marketing Manager /  Aratours Travel Services
    
    
      
      Mine Surveyor / Lydian Armenia
    
    
      
      Field Engineer / Lydian Armenia
    
    
      
      Analyst / Atenk
    
    
      
      Warranty Manager / Nissan Armenia
    
    
      
      Customer Care and Sales Representative / e-World Systems
    
    
      
      Sales and Service Specialist / Ucom
    
    
      
      Database Developer / Armeconombank
    
    
      
      Web Development Team Manager / Web Projects
    
    
      
      Back Office Officer / FXTM Armenia
    
    
      
      Chemist/ Microbiologist / Jermuk International Pepsi-Cola Bottler
    
    
      
      QA Engineer / Zangi
    
    
      
      Chef / Brand Group
    
    
      
      French Language Specialist / AKKE Trading
    
    
      
      Brand Manager / Rouge Beaute
    
    
      
      Travel Consultant / D&V Travel
    
    
      
      Java Software Developer / HS International
    
    
      
      .NET Service Engineer/ Analyst / HS International
    
    
      
      Financial Specialist / SAS Group
    
    
      
      Logistics Support Specialist / Reload Freight Systems
    
    
      
      Business Development Manager / Euro Truck
    
    
      
      Senior Software Engineer / RUSHC
    
    
      
      IT Specialist / Unitech
    
  






































































































































































































    INTERNSHIPS
    
      
      Branch Intern / HSBC Bank Armenia
    
    
      
      Contact Center Intern / HSBC Bank Armenia
    
  



    TRAININGS
    
      
      English Language Courses / Career Center
    
  


    NEWS
    
      
      English Language Courses for Schoolchildren / Career Center
    
  


    COMPETITIONS
    
      
      Open Tender for Choosing an Organization to Purchase Ink System for ATM Cassettes / Union of Banks of Armenia
    
  

One thing that can be considered as an advantage to the lxml library is that it provides two options for scraping: 1) CSS selectors (similar to BeautifulSoup) and 2) XPath. The latter is not supported by BeautifulSoup, yet sometimes may be quite handy. XPath is the navigation tool for the XML files (that lxml is meant for worknig with). To work with XPath one needs to use the forward slash sign (/) to define address and the "dog" sign (@) inside square brackets ([ ]) to define an attibute. To look for the very first table //table path can be used.


In [9]:
tree.xpath('//table')[-1].text_content()


Out[9]:
'\n    COMPETITIONS\n    \n      \n      Open Tender for Choosing an Organization to Purchase Ink System for ATM Cassettes / Union of Banks of Armenia\n    \n  '

To find the table that has a border argument with a value of 0, the following approach should be used.


In [10]:
tree.xpath('//table[@border="0"]')[-1].text_content()


Out[10]:
'\n    COMPETITIONS\n    \n      \n      Open Tender for Choosing an Organization to Purchase Ink System for ATM Cassettes / Union of Banks of Armenia\n    \n  '

If one is interested in getting the value of an attibute (similar to get() in BeautifulSoup), then @ without square brackets can be used after the / sign as follows:


In [11]:
tree.xpath('//table/@border')


Out[11]:
['0', '0', '0', '0', '0']